Integration of persistent Laplacian and pre-trained transformer for protein solubility changes upon mutation

Protein mutations can significantly influence protein solubility, which results in altered protein functions and leads to various diseases. Despite of tremendous effort, machine learning prediction of protein solubility changes upon mutation remains a challenging task as indicated by the poor scores of normalized Correct Prediction Ratio (CPR). Part of the challenge stems from the fact that there is no three-dimensional (3D) structures for the wild-type and mutant proteins. This work integrates persistent Laplacians and pre-trained Transformer for the task. The Transformer, pretrained with hunderds of millions of protein sequences, embeds wild-type and mutant sequences, while persistent Laplacians track the topological invariant change and homotopic shape evolution induced by mutations in 3D protein structures, which are rendered from AlphaFold2. The resulting machine learning model was trained on an extensive data set labeled with three solubility types. Our model outperforms all existing predictive methods and improves the state-of-the-art up to 15%.


Introduction
Genetic mutations alter the genome sequence, leading to changes in the corresponding amino acid sequence of a protein.These alternations have far-reaching implications on the protein's structure, function, and stability, affecting attributes such as folding stability, binding affinity, and solubility.The consequences of protein mutations have been extensively studied in diverse fields such as evolutionary biology, cancer biology, immunology, directed evolution, and protein engineering [1].Understanding the impact of genetic mutations on protein solubility is crucial in various fields, including protein engineering, drug discovery, and biotechnology.Accurately analyzing and predicting the impact of mutations on protein solubility is therefore crucial in many fields, facilitating the engineering of proteins with desirable functions.There are numerous intricately interconnected factors impacting protein solubility, ranging from amino acid sequence arrangement, post-translational modifications, protein-protein interactions, to environmental conditions, such as solvent type, ion type and concentration, the presence of small molecules, temperature, etc.Unfortunately, the existing data set does not contain sufficient information.This complexity poses significant challenges for the accurate prediction and modeling of protein solubility, often requiring multifaceted computational approaches for reliable outcomes.
Computational predictions serve as a valuable complement to experimental mutagenesis analysis of protein stability changes upon mutation.Such computational approaches offer several advantages, including being economical, efficient, and provide a viable alternative to laborintensive site-directed mutagenesis experiments [2].As a result, the development of accurate and reliable computational techniques for mutational impact prediction could substantially enhance the throughput and accessibility of research in protein engineering and drug discovery.
Over the years, a variety of computational methods have been developed to explore the ef-fects of mutations on protein solubility, including but not limited to CamSol [3], OptSolMut [4], PON-Sol [5], SODA [6], Solubis [7], and others as summarized in a recent review [8].Cam-Sol employs an algorithm to construct a residue-specific solubility profile, although no explicit method has been made publicly available.OptSolMut is trained on 137 samples, each featuring single or multiple mutations affecting solubility or aggregation.PON-Sol utilizes a random forest model trained on a dataset of 406 single amino acid substitutions labeled as solubilityincreasing, solubility-decreasing, or exhibiting no change in solubility.SODA, which is based on the PON-Sol data, specifically targets samples with decreasing solubility [6].Solubis is an optimization tool that increases protein solubility and integrates interaction analysis from FoldX [2], aggregation prediction from TANGO [9], and structural analysis from YASARA [10].Recently, PON-Sol2 [11] extended the original PON-Sol dataset and employed a gradient boosting algorithm for sequence-based predictions.Despite of intensive effort, the current prediction accuracy in terms of normalized Correct Prediction Ratio (CPR) remains very low, calling for innovative strategies.
Topological data analysis (TDA) is a relatively new approach for data science.Its main technique is persistent homology [12,13].The essential idea of persistent homology is to construct a multiscale analysis of data in terms of topological invariants.The resulting changes of topological invariants over scales can be used to characterize the intricate structures of data, leading to an unusually powerful approach in describing protein structure, flexibility, and folding [14].
Persistent homology was integrated with machine learning for the classification of proteins in 2015 [15], which was one the first integrations of TDA and machine learning, and the predictions of mutation-induced protein stability changes [16,17] and protein-protein binding free energy changes [18,19].One of the major achievements of TDA is its winning of D3R Grand Challenges, an annual worldwide competition series in computer-aided drug design [20,21].A nearly comprehensive summary of the early success of TDA in biological science was given in a review [22].However, persistent homology only tracks the changes in topological invariants and cannot capture homotopic shape evolution of data over scales or induced by mutations.To overcome this limitation, Wei and coworkers introduced persistent combinatorial Laplacians, also called persistent spectral graphs, on point clouds [23] and evolutionary de Rham-Hodge method on manifolds [24] in 2019.The essence of these methods is the persistent topological Laplacians (PTLs) either on point clouds or on manifolds.PTLs not only fully capture the topological invariants in its harmonic spectra as those given by persistent homology, but also capture the homotopic shape evolution of data during the multiscale analysis or a mutation process.PTLs were applied to the predictions of protein flexibility [25] and protein-ligand binding free energies [26], protein-protein interactions [27,28], and protein engineering [1].The most remarkable accomplishment by persistent Laplacian is its accurate forecasting of emerging dominant SARS-CoV-2 variants BA.4 and BA.5 about two months in advance [29].
However, TDA approaches depend on the biomolcular structures, which may not be available.In fact, many proteins involved in the present study do not have 3D structures.In recent years, advanced natural language processing (NLP) models, including Transformers and long short-term memory (LSTM), have been widely implemented across various domains to extract protein sequence information.For example, Tasks Assessing Protein Embedding (TAPE) introduced three different architectures, namely transformer, dilated residual network (ResNet), and LSTM [30].Additionally, LSTM-based models like Bepler [31] and UniRep [32] have been developed.Additionally, large-scale protein transformer models trained on extensive datasets comprising hundreds of millions of sequences have made significant advancements in the field.These models, including Evolutionary Scale Modeling (ESM) [33] and ProtTrans [34,35], have exhibited exceptional performance in capturing a variety of protein properties.ESM, for instance, allows for fine-tuning based on either downstream task data or local multiple sequence alignments [36].In the present work, we leverage the pre-trained ESM transformer model to extract crucial information from protein sequences.
In this work, we will integrate transformer-based sequence embedding and persistent topological Laplacians to predict protein solubility changes upon mutation.While sequence-based models can be applied without 3D structural information, the PTL-based features require highquality structures.We generate 3D structures of wild type proteins from AlphaFold2 [37] to facilitate topological embedding.By combining both embedding approaches, they naturally complement each other in classifying protein solubility changes upon mutation.These embeddings are fed into an ensemble classifier, gradient boosted trees (GBT), to build a machine learning model, called TopLapGBT.We validate TopLapGBT on the classification of protein solubility changes upon mutation.We demonstrate that this integrated machine learning model gives rise to a substantial improvement as compared to existing state-of-the-art models.
Residue-Similarity plots are also applied to assess how well the TopLapGBT model classify three solubility labels.

Overview of TopLapGBT
TopLapGBT integrates both structure-based and sequence-based features, derived from protein structures and sequences respectively, into a unified model.Our architecture comprises three distinct embedding modules: persistent Laplacian-based embeddings, sequence-based embeddings, and auxiliary feature embeddings, all of which feed into an ensemble classifier as depicted in Figure 1.
In the persistent Laplacian-based feature embedding module, we employ persistent Laplacian techniques to generate features that encapsulate the structural attributes of proteins both pre-and post-mutation.This approach is particularly effective in capturing the structural alter-Figure 1: The illustration of the workflow for TopLapGBT.Protein sequences are first preprocessed by AlphaFold 2 to generate wild type protein structures.Mutant proteins are generated from the Jackal software [38].The structure-based features from persistent Laplacian, auxiliary and sequence-based features are then concatenated to form a long feature input for gradient boosting tree to classify the protein solubility changes upon mutation.The predicted labels are also analyzed on Residue-Similarity (R-S) plots.ations induced by mutations within the localized neighborhoods of the mutation sites.Mathematically, the persistent Laplacian builds a sequence of simplicial complexes through a filtration process, thereby characterizing atom-atom interactions across multiple scales (details in the Methods section).In the sequence-based feature embedding module, a pre-trained transformer model generates latent feature vectors extracted from protein sequences.Specifically, the transformer model used here is a 650M-parameter protein language model, trained on a corpus of 250M protein sequences spanning multiple organisms [39].Finally, the auxiliary feature embedding module incorporates a variety of attributes such as surface area, partial charge, pK a shifts, solvation free energy, and secondary structural information, synthesized from both protein sequences and structures.These three distinct sets of feature embeddings are subsequently concatenated to produce a comprehensive feature vector.This vector is then fed into a gradient-boosting tree classifier to categorize the mutation-induced samples.

Performance of TopLapGBT on PON-Sol2 dataset
In our study, we utilize the dataset employed by PON-Sol2 as detailed in [11].The dataset is comprised of 6,328 mutation samples, originating from 77 distinct proteins.These samples are categorized into three labels: decrease in solubility, increase in solubility, and no change in solubility.Specifically, the dataset contains 3,136 samples demonstrating a decrease in solubility, 1,026 samples showing an increase, and 2,166 samples with no change.Notably, the dataset exhibits a class imbalance, with a ratio of 1 : 0.69 : 0.34, indicating a bias towards samples that exhibit a decrease in solubility.To assess the performance of our model, we initially carry out a random 10-fold cross-validation on the dataset.Subsequently, an independent blind test prediction is executed to provide further validation of the model's efficacy.
In Table 1, we present a comparative analysis of the performance of existing classifiers by PON-Sol [5] and PON-Sol2 [11] against our proposed model, TopLapGBT, using 10-fold cross-validation.It should be noted that PON-Sol2 incorporates feature selection techniques such as recursive feature elimination (RFE).To provide a robust assessment of TopLapGBT's performance, we conduct 10 repeated runs, and the mean values of these runs are reported to account for any randomness in the model's output.
Performance evaluation of our model, TopLapGBT, is conducted using a range of metrics, including Positive Predictive Value (PPV), Negative Predictive Value (NPV), Sensitivity, Specificity, Correct Prediction Ratio (CPR), and Generalized Squared Correlation (GC 2 ).PPV and NPV quantify the proportions of correct positive and negative predictions for each solubility class, respectively.Given that we are dealing with a K-class problem with three distinct solu-Table 1: Comparison of performance metrics between TopLapGBT and both single layer and double layer classifiers of PON-Sol2 in the 10-fold crossvalidation.The negative solubility samples are denoted as "-" whereas the positive solubility change samples are denoted as "+".The samples with no solubility change are denoted as "N".Performance metrics include the positive predicted values (PPV), negative predicted values (NPV), sensitivity, specificity, correct prediction ratio (CPR) and generalised correlation (GC 2 ).PPV refers to the proportions of positive predictions for each solubility class while NPV refers to the proportions of negative predictions for each solubility class.CPR calculates the percentage of correctly classified samples while GC 2 measures the correlation coefficient of the classification.All normalized metrics are also reported.For each metric, the first value is without normalization while the second one is with normalization.performance [40].Specifically, CPR measures the overall accuracy of the model, while GC 2 quantifies the correlation coefficient of the classification, ranging from 0 to 1. Larger values for these metrics denote better performance.Importantly, due to the class imbalance in the number of mutation samples across the categories, all performance metrics are normalized to ensure a robust and reliable evaluation of the model's efficacy (further details are elaborated in the Methods section).
The proposed model, TopLapGBT, demonstrates significant performance gains over exist-ing featurization methods in PON-Sol2 across all evaluation metrics [11].Specifically, normalized CPR and GC 2 scores of TopLapGBT stand at 0.688 and 0.361, marking improvements of 4.88% and 15.71% over PON-Sol2, respectively.These gains underscore the merit of incorporating both structure-based and sequence-based features into the model.To elucidate the contribution of Persistent Laplacian (PL)-based features, we also present a comparative analysis with our TopGBT model in Table 1.The TopGBT model utilizes persistent homology-based embeddings alongside auxiliary and pre-trained transformer features.While TopGBT still outperforms all existing PON-Sol2 models, the incorporation of PL-based features in TopLapGBT leads to an incremental improvement of 1% and 2% in CPR and GC 2 metrics, respectively.This validates our approach of leveraging Persistent Laplacian to comprehensively capture both the topological and homotopic nuances in the evolution of protein structures.

Performance of TopLapGBT on independent test set
To robustly assess the performance of TopLapGBT, we subjected it to an independent test using the same dataset employed by PON-Sol2 [11].In this validation, TopLapGBT consistently outperformed all five existing models, as evidenced in Table 2. Specifically, TopLapGBT registers a normalized CPR of 0.564 and a normalized GC 2 of 0.185, surpassing PON-Sol2 by 3.49% and 17.83%, respectively.Relative to TopGBT, the inclusion of PL-based features in TopLapGBT yielded incremental gains in both CPR and GC 2 metrics, thereby further substantiating the utility of Persistent Laplacian in capturing the homotopic shape evolution within protein structures.

Discussion
The performance of machine learning models generally relies on the nature of the input features.In our model, the PL-based features depend on one main element which is the quality of the protein structures from AlphaFold 2 (AF2).The quality of AF2 structures are crucial to achieve comparable performance to nuclear magnetic resonance (NMR) structures while ensemble methods can be used to enhance the performance by combining multiple NMR structures [1].This allows AF2 structures to serve as a practical substitute for experimental structural data.Although AF2 structures are not as reliable as X-ray structures, the fusion of sequencebased pre-trained transformer features and PL-based features provides robust featurization even for low quality AF2 structural data.PL elucidates the precise mutation geometry and topology, while sequence-based pre-trained transformer features capture evolutionary patterns from an extensive sequence library.This synergy holds significance and can be applied to a diverse range of other challenges in the field of biomolecular research.For the rest of this section, we analyze the model's performance based on the region of the mutations and the type of muta-tions.We also discuss the performance of different feature types using the Residue-Similarity plots.

Performance analysis based on different mutation regions
To delve deeper into the model's performance, we categorize mutation samples based on their structural regions: interior and surface, as depicted in Figure 2 pre-and post-mutations.These regions are defined by their relative accessible solvent area (rASA), using a cutoff value c.
A residue at the mutation site is classified as buried or interior if its rASA falls below this cutoff.While the discrete nature of c initially raised concerns, given that amino acids have a continuous exposure profile, empirical analyses on databases from organisms like Escherichia coli, Saccharomyces cerevisiae, and Homo sapiens have shown that an optimal rASA cutoff of approximately 25% is effective for distinguishing between surface and interior residues [41].In our analysis, we apply this framework to identify surface and interior residues in the solubility dataset.We observe that some mutation sites undergo a regional transition, moving from one structural domain to another, consequent to the mutation.
To gain nuanced insights into TopLapGBT's performance, we segment the results according to the mutation's structural location within the protein.We present these segmentations as heatmap plots that delineate both mutation regions and amino acid types.Structural regions are defined based on relative accessible surface area (rASA) [41].By categorizing residues as either interior or surface, we can examine the influence of continuous amino acid exposure on solubility change classification post-mutation.

Performance analysis based on different mutation types
Switching focus to mutation types, our model's capability in classifying solubility changes also merits exploration across the 20 distinct amino acid types in the dataset.In addition to this, we subgroup amino acids as charged, polar, hydrophobic, or special case.Table S1 enumerates the sample counts for each mutation group pair.Figure 3

Feature analysis based on Residue-Similarity plots
The Residue Similarity Index (RSI) serves as a potent metric for evaluating the efficacy of dimensionality reduction in both clustering and classification contexts [43].RSI has proven its value in generating classification accuracy scores that align well with supervised methods in single-cell typing.When applied to our solubility change dataset, Residue-Similarity (R-S) plots can be constructed to scrutinize how the Residue Index (RI) and Similarity Index (SI) may indicate the quality of cluster separation.TopLapGBT, we contrast the R-S plots with UMAP visualizations (shown in Figure S2).It becomes evident that UMAP plots fail to form clusters that are as distinct as those observed in R-S plots, thereby reinforcing the need for a specialized approach to classify mutation samples effectively.
The impetus for utilizing structure-based features stems from the multifaceted relationship that exists among protein sequence, structure, and solubility.Factors such as hydrophobicity, charge distribution, and intermolecular interactions contribute to the complexity of protein solubility.Traditional prediction methods, which often rely on empirical rules or rudimentary descriptors, fall short in capturing this intricate molecular interplay.By employing advanced mathematical techniques like persistent Laplacian (PL) coupled with machine learning algorithms, we can decipher the complex patterns and relationships embedded within protein sequences and structures.Persistent Laplacian, in particular, provides a robust mathematical representation that captures both the topological and homotopic evolution of protein structures.
Furthermore, machine learning models rooted in advanced mathematics offer several advantages for classifying changes in protein solubility.These models are well-suited for handling high-dimensional and complex data sets, such as those involving protein sequences and structures.They are also capable of learning non-linear relationships and capturing nuanced dependencies that are often overlooked by traditional linear models.Importantly, these advanced models can adeptly manage class-imbalanced datasets, which are commonly encountered in protein solubility studies.

Conclusion
In the multifaceted quest to understand mutation-induced solubility changes, various scientific domains including quantum mechanics, molecular mechanics, biochemistry, biophysics, and molecular biology have made significant contributions.Despite these concerted efforts, state-of-art models have limitations, as evidenced by their normalized CPR value of 0.656 even after employing feature selection methods.Persistent homology (PH) has emerged as a powerful tool for capturing the complexity of biomolecular structures and has achieved noteworthy success in drug discovery applications.However, its inability to capture the nuances of homotopic shape evolution, crucial for delineating molecular interactions in proteins, presents a critical shortcoming.
Our study introduces TopLapGBT, a novel model that integrates persistent Laplacian (PL) features with pretrained transformer features, thereby bridging the gap in capturing both topology and homotopic shape evolution.This innovative fusion leads to significant advancements in classification performance.Specifically, TopLapGBT achieves normalized CPR and GC 2 scores of 0.688 and 0.361, respectively, marking improvements of 4.88% and 15.71% over the state-of-the-art PON-Sol2.These findings are further corroborated by an independent blind test, where TopLapGBT continues to outperform existing models.
In summary, our proposed TopLapGBT model not only achieves superior performance over existing state-of-the-art methods but also introduces a more nuanced approach for the classification of protein solubility changes upon mutation.These results underscore the transformative potential of integrating geometric and topological features with machine learning in advancing the field of molecular biology.

Materials and Methods
In this section, we endeavor to elucidate key mathematical and computational foundations that are instrumental for the work presented in this study.Specifically, we delve into spectral graph theory, simplicial complex, and persistent Laplacian methods, highlighting their significance in capturing topological and spectral properties essential for the characterization of proteins.
Additionally, we discuss machine learning and deep learning paradigms, focusing on their role in processing, analyzing, and interpreting these complex features, especially within the confines of test datasets and validation settings.

Persistent Laplacian characterization of proteins
Simplicial complex A simplicial complex is made up of a set of simplices and generalises beyond graph networks at higher dimensions [44,45,46,47].Every simplex is a finite set of vertices which can be interpreted as the atoms in a protein structure.Essentially, simplices can be a point (0-simplex), an edge (1-simplex), a triangle (2-simplex), a tetrahedron (3-simplex), or in higher dimensions, a p-simplex.In other words, a k-simplex A geometric simplicial complex K is a finite set of geometric simplexes that satisfy two essential conditions.First, any face of a simplex from K is also in K. Second, the intersection of any two simplexes in K is either empty or shares faces.Commonly used methods to construct simplicial complexes are Čech complex, Vietoris-Rips complex, Alpha complex, Clique complex, Cubic complex, and Morse complex [44,45,46,47].
Chain Group A k-th chain group C k is a free Abelian group generated by oriented k-simplices where -simplex, which is constructed by the all the vertices except v i , i.e., removing v i from the simplex.The boundary operator satisfies The adjoint of ∂ k , which is This will be used in the combinatorial Laplacian.
Combinatorial Laplacian For the k-boundary operator More specifically, let m and n be the number of (k − 1)-simplices and p-simplices respectively in a simplicial complex K.The m × n boundary matrix B k has entries defined as follows: where 1 ≤ i ≤ m and 1 ≤ j ≤ n.Here, Then the k-combinatorial Laplacian or the topological Laplacian is a linear operator ∆ k : The k-combinatorial Laplacian exhibits an n × n matrix representation L k and is given by In the case k = 0, then The number of rows in B k represents the number of (k − 1)-simplices in K and the number of columns refers to the number of k-simplices in K. Furthermore, the upper k-combinatorial 1 with B 0 being a zero matrix and K being an oriented simplicial complex of dimension 1.In fact, the 0-combinatorial Laplacian matrix L 0 (K) is actually the graph Laplacian in spectral graph theory.
The above graph Laplacian matrices can be explicitly described in terms of the simplex relations.More precisely, L 0 can be described as which is equivalent to the graph Laplacian.Furthermore, when k > 0, L k can be expressed as Here, we denote σ k−1 j ∼ σ k i if they have the same orientation, i.e. similarly oriented.Furthermore, we say that two k-simplices σ k i and σ k j are upper adjacent (resp.lower adjacent) neighbors, denoted as σ k i ⌢ σ k j (resp.σ k i ⌣ σ k j ), if they are both faces of a common (k + 1)simplex (resp.they both share a common (k − 1)-simplex as their face).In addition, if the orientations of their common lower simplex are the same, it is called similar common lower simplex (σ k i ⌣ σ k j and σ k i ∼ σ k j ).On the other hand, if the orientations are different, it is called dissimilar common lower simplex (σ k i ⌣ σ k j and σ k i ≁ σ k j ).The (upper) degree of a k-simplex σ k i , denoted as d(σ k i ), is the number of (k + 1)-simplices, of which σ k i is a face.
The eigenvalues of combinatorial Laplacian matrices are independent of the choice of the orientation [48].Furthermore, the multiplicity of zero eigenvalues, i.e. the total number of This nested sequence of simplicial complexes induces a family of chain complexes where is the chain group for the subcomplex K t , and its k-boundary operator is In the case k < 0, then C k (K t ) is an empty set and ∂ t k is a zero map.
For 0 < k < dim(K t ), the boundary operator with The topological and spectral characteristics can then be studied from L k (K t ) by varying the filtration parameter and diagonalizing the k-combinatorial Laplacian matrix.The multiplicity of the zero spectra of L t k is the persistent Betti number β t k , which represents the number of k-dimensional holes in K t .In other words, In particular, β t 0 represents the number of connected components in K t , β t 1 counts the number of one-dimensional cycles in K t and β t 2 reveals the number of two-dimensional voids in K t .
In addition, the spectra of L t k can be written in the following ascending order where L t k here is an n × n matrix.The p-persistent k-combinatorial Laplacian can be extended based on the boundary operator as well.Further details can be found in [25].
In order to illustrate the difference between PL and PH, Figure 5 describes a point cloud, basic simplices, a filtration process and the comparison between persistent Laplacian and persistent homology barcodes of 13 points.The filtration process in Figure 5(c) shows the different stages of a Rips filtration process for the 13 points.Figure 5(d) shows the persistent homology barcodes (in blue) and persistent non-harmonic spectra (in red).It can be seen that the nonharmonic spectra provides the additional homotopic shape evolution that is missing in persistent homology in the later part of the filtration process.

Persistent Laplacian descriptors
In order to capture the mutation-induced solubility change, we apply the persistent Laplacian (PL) to characterize the interactions between the mutation site and the rest of the protein.
To describe these interactions, we first propose the interactive PL with the distance function DI(A i , A j ) describing the distance between two atoms A i and A j defined as where  [50,12] and the non-harmonic spectra of persistent Laplacians (PLs) [25] from the filtration process in (c).The x-axis represents the filtration parameter f .By discretising the filtration region into equal-sized bins and adding all the Betti bars together, the topological invariants are summarized into persistent Betti numbers that acts a topological descriptor extracted from protein structures.Persistent Laplacians (PLs) [25] for thirteen points.The first non-zero eigenvalues of dimension 0, λ 0 (r), and dimension 1, λ 1 (r), of PLs are depicted in red.The harmonic spectra of PLs return all the topological invariants of PH, whereas the non-harmonic spectra of PLs capture the additional homotopic shape evolution of PLs during the filtration that are neglected by PH.
The set of persistent spectra from each persistent Laplacian computation consists of V DI γ,α,β and V DE γ,α,β where γ ∈ {M, W } refers to the mutant protein or the wild type protein, α ∈ {C, N, O} is the atom type chosen in the rest of the protein and β ∈ {C, N, O} is the atom type chosen in the mutation site.V DI γ,α,β applies the distance DI-based filtration to generate 0-dimensional Laplacian using the Vietoris-Rips complex and V DE γ,α,β applies the Euclidean distance DE-based filtration to generate 1 and 2-dimensional Laplacian using the alpha complex.
In total, there are 54 sets of persistent spectra.The persistent spectra from PL contains both harmonic and non-harmonic spectra that are capable of revealing the molecular mechanism of protein solubility.For zero dimensions, we consider both the harmonic spectra and non-harmonic spectra in-formation for each persistent Laplacian.Filtration using Rips complex with DI distance is used.The 0-dimensional PL features are generated from 0 Å to 6 Å with 0.5 Å gridsize.For the non-harmonic spectra information, we count the number of non-harmonic spectra and calculate seven statistical values of non-harmonic spectra such as sum, minimum, maximum, mean, standard deviation, variance and the sum of eigenvalues squared.This generates eight statistical values for each of the nine atomic pairs.Therefore, the dimension of 0-dimensional PL features for a protein is 72.In total, the 0-dimensional PL-based feature size after concatenating features at different dimensions for wild type and mutant is 1872.
For one or two dimensions, we perform the filtration using Alpha complex with the DE distance.The limited number of atoms in the local protein structure can create only a few high-dimensional simplexes, resulting in minimal alterations in shape.As a result, it suffice to consider features from only harmonic spectra of persistent Laplacians by coding the topological invariants for the high-dimensional interactions.Using GUDHI [51], the persistence of the and two-dimensional PL feature vectors for a protein is 140.In total, the higher-dimensional PL-based feature size after concatenating features at different dimensions for wild type, mutant and their difference is 420.

Persistent Homology
Persistent homology is part of the harmonic spectra of PL.The homology groups in PH illustrate the persistence of topological invariants, hence providing the harmonic spectral information in PL.The site-and element-specific PH features are generated in a similar way as compared to PL.Similar to PL, filtration construction is also employed to PH.For the zero dimension, the filtration parameter can be discretized into several equally spaced bins, namely [0, 0.5], (0.5, 1], • • • , (5.5, 6] Å.The death value of the bars are summed in each bin resulting in 12×18 features.
For each bin, we count the numbers of persistent bars, resulting in a nine-dimensional vector for each point cloud.Similarly, this is performed for each of the nine single atomic pairs.Hence, the dimension of PH features for a protein is 216.For one or two dimensions, the identical featurization from the statistics of persistent bars in PH is used.The PH embedding combines features at different dimensions as described above and concatenated for wild type, mutant and their difference, resulting in a 648-dimensional vector.

Transformer Features
Recently, we have seen significant advancements in modelling protein properties using largescale protein transformer models trained on hundreds of millions of sequences.These models, like ESM [33] (evolutionary scale modeling) and ProtTrans [34,35], have demonstrated impressive performance.Moreover, hybrid fine-tuning approaches that leverage both local and global evolutionary data have proven to enhance these models even further.
Furthermore, specificity and sensitivity can be computed by the following: The correct prediction ratio (CPR) and generalized squared correlation (GC 2 ) are used to evaulate the overall performance of TopLapGBT.CPR and GC 2 can be computed as where K is the number of classes and N is the number of samples.Here, z ij represents the number of samples of class i to class j.Let x i = j z ij be the number of inputs from class i, and y j = i z ij be the number of inputs predicted to class j.Then the expected number of samples in (i, j)-th entry of the multiclass confusion matrix is Since the mutational samples across the three solubility classes are imbalanced, we normalized the values to provide more reliable calculation of performance metrics.

Software and resources
Protein sequences are first preprocessed by AlphaFold 2 to generate wild type protein structures.In particular, 3D protein structures are generated from protein sequences using Colab-Fold [52].Mutant proteins are generated from the Jackal software [38].All TopLapGBT models are built using the sklearn machine learning library [53].The hyperparameters for all the TopLapGBT are: n estimators = 20000, learning rate = 0.05, max depth = 7, subsample=0.4,min sample split = 3 and max features = sqrt.The PQR files, which contains the partial charge information of the proteins, are generated from the PDB2PQR software [54].The PQR files for both the wild type proteins are generated with AMBER force field.The solvation energy and surface area information are calculated from the in-house online software package ESES [55] and MIBPB [56].The pKa values are computed from the PROPKA software package [57].The position-specific-scoring matrices (PSSM) are computed from the BLAST+ software [58] using the nr database.The secondary structure features and torsion angle sequence-based information are calculated from SPIDER [59].The persistent Laplacian descriptors for both VR complexes and alpha complexes are calculated using the GUDHI software library [60].All computational work in support of this research was performed using the resources from the National Super

Figure 2 (
b) displays accuracy scores for four types of mutations: interior-interior, interior-surface, surface-interior, and surface-surface.TopLapGBT attains an average accuracy score of 0.770 across these categories.Extended data in FigureS1further breaks down accuracy scores for all 20 distinct amino acids within each region-pair, revealing variations in residue-residue pair performance.

Figure 2 :
Figure 2: (a) The definitions of the structural regions on the protein label 213133708 with mutation ID: I283W.For both wild type and mutant type, amino acids in the proteins are classified under surface or interior regions based on the rASA of the residue.The residue ID 283 of protein label 213133708 was mutated from isoleucine (interior region) to trytophan (surface region).Structures are plotted with the software Illustrate[42].(b) A comparison of performance of TopLapGBT among different mutation region types.The y-axis represents the region type for the original residue and the x-axis represents the region type for the mutated residue.The numbers indicated in each cell corresponds to the number of mutation samples in each regionregion mutation pair.The accuracy scores (CPR) for both interior-interior and interior-surface are 0.813 and 0.812 while the accuracy score for both surface-interior and surface-surface are 0.725 and 0.730.
(a) displays accuracy scores for each mutation group pair, while Figure3(b) shows scores for each amino acid pair.Notably, the special-charged and special-polar groups register the highest accuracy, whereas the polarhydrophobic and polar-special groups underperform.One plausible reason could be the inherent complexity in accurately classifying mutations with non-negative solubility changes.It's worth noting that PON-Sol2 employed a two-layer classifier to improve classification[11].Our results indicate that TopLapGBT surpasses the performance of this two-layer system.

Figure 3 :
Figure 3: A comparison of 10-fold cross validation accuracy scores (CPR) for (a) different mutation groups and (b) its associated amino acid types.The y-axis labels the residue type of the original protein, whereas the x-axis labels the residue type of the mutant.The squares colored in black in (b) have zero mutation samples.For a reverse mutation, the labels are taken with reverse solubility change unless the change is zero.

Figure 4
Figure 4 juxtaposes the R-S plots derived from TopLapGBT against those from various feature sets utilized in model training.Across all visualizations, samples manifest a range of classification outcomes-both correct and incorrect-for each true label.However, a noteworthy observation is that Transformer-pretrain and persistent Laplacian-based features demon-

Figure 4 :
Figure 4: The comparison of R-S plots between the different types of features used in TopLapGBT model.The y-axis represents the residue score, whereas the x-axis represents the similarity score.RS scores were computed for the testing set, and all 10-folds were visualized.Each section corresponds to one of the 3 true solubility labels, and the sample's color and marker correspond to the predicted label from TopLapGBT.
define B k to be an m × n matrix representation of the boundary operator under the standard bases {σ k i } n i=1 and {σ k−1 j } m j=1 of C k and C k−1 .Similarly, the matrix representation of ∂ * k is the transpose matrix B ⊤ k , with respect to the same ordered bases of the boundary operator ∂ k .

1 Figure 5 :
Figure 5: The illustration of (a) point cloud, (b) basic simplices and (c) the filtration process and (d) Comparison between PH barcodes[50,12] and the non-harmonic spectra of persistent Laplacians (PLs)[25] from the filtration process in (c).The x-axis represents the filtration parameter f .By discretising the filtration region into equal-sized bins and adding all the Betti bars together, the topological invariants are summarized into persistent Betti numbers that acts a topological descriptor extracted from protein structures.Persistent Laplacians (PLs)[25] for thirteen points.The first non-zero eigenvalues of dimension 0, λ 0 (r), and dimension 1, λ 1 (r), of PLs are depicted in red.The harmonic spectra of PLs return all the topological invariants of PH, whereas the non-harmonic spectra of PLs capture the additional homotopic shape evolution of PLs during the filtration that are neglected by PH.

1 Figure 6 :
Figure 6: An illustration of interactive PL showing hydrophillic interactions based on DI-based filtration (left) and hydrophobic interactions based on DE-based filtration (right) at a mutation site.(a) Hydrophillic interactions between the nitrogen atoms (red) and oxygen atoms (yellow).(b) Hydrophobic interactions between the carbon atoms (cyan) and oxygen atoms (dark blue).The hexagon ring is colored in red and the triangle is colored in yellow.(c) The PH barcodes and PL for dimension 0 of the hydrophillic interactions in (a).(d) The PH barcodes and PL for dimension 0 and dimension 1 of the hydrophobic interactions in (b).The Betti-1 bar is due to the red hexagon ring in (b).
harmonic spectra can be represented by persistent barcodes.The topological feature vectors are generated by computing the statistics of bar lengths, births and deaths.Bars shorter than 0.1 Å are excluded as they do not exhibit any clear physical meaning.The remaining bars are then used for computing the statistics: (1) sum, maximum and mean for lengths of bars; (2) minimum and maximum for the birth values of bars; (3) minimum and maximum for the death values of bars.Each set of point clouds leads to a seven-dimensional vector.These features are calculated on nine single atomic pairs and one heavy atom pair.The dimension of one- For instance, eUniRep is an improved LSTM-based UniRep model achieved through fine-tuning with knowledge extracted from local multiple sequence alignments (MSAs).Additionally, the ESM model can be fine-tuned using either downstream task data or local MSAs.In our research, we employed the ESM-1b transformer, a model that falls under the transformer architecture.This particular variant was trained on a dataset of 250 million sequences using a masked filling procedure and boasts an architecture comprising 34 layers with a whopping 650 million parameters.The ESM transformer's primary role in our work was to generate sequence embeddings.At each layer of the ESM model, it encoded a sequence of length L into a matrix sized at 1,280×L, excluding the start and terminal tokens.For our study, we utilized the sequence representation derived from the final (34th) layer and computed the average along the sequence length axis, resulting in a 1,280-component vector.5.5 Performance MetricsPPV and NPV assesses the true positive and true negative proportion of the predicted results for each solubility class.PPV and NPV are computed based on TP, TN, FP and FN which represents the true positive, true negative, false positive and false negative values for each solubility class.For each solubility class, PPV and NPV can be computed by:

Table 2 :
Performance of TopLapGBT with existing state-of-the-art models on the independent blind test classification.The negative solubility samples are denoted as "-" whereas the positive solubility change samples are denoted as "+".The samples with no solubility change are denoted as "N".Performance metrics include the positive predicted values (PPV), negative predicted values (NPV), sensitivity, specificity, correct prediction ratio (CPR) and generalised correlation (GC 2 ).PPV refers to the proportions of positive predictions for each solubility class while NPV refers to the proportions of negative predictions for each solubility class.CPR calculates the percentage of correctly classified samples while GC 2 measures the correlation coefficient of the classification.All normalized metrics are also reported.For each metric, the first value is without normalization while the second one is with normalization.
in determining the performance of TopLapGBT.Recently, AF2 structures have been reported